-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Fix struct casts to align fields by name (prevent positional mis-casts) #19674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Route ColumnarValue::cast_to through a name-based struct casting path for both array and scalar struct values. Introduce a helper to reorder struct children by target field names, insert nulls for missing fields, and recursively cast each child with Arrow options. Add unit tests to verify struct field reordering and null-filling for missing fields when casting between struct schemas.
…::cast_to Add comprehensive documentation explaining that struct casting uses field name matching rather than positional matching. This clarifies the behavior change for struct types while preserving existing documentation for other types. Addresses PR review recommendation #4 about documenting public API changes.
Replace .clone() with Arc::clone() to address clippy warning about clone_on_ref_ptr. This makes the ref-counting operation explicit and follows Rust best practices. Fixes clippy error from rust_lint.sh.
## Problem The PR to fix struct casting (issue apache#14396) introduced regressions where struct casting with field additions/removals was failing, and name-based field matching wasn't working correctly in all scenarios. ## Root Causes Identified 1. **Field index mismatch**: cast_struct_array_by_name was using field indices from the DataType instead of the actual StructArray, causing wrong column access when field names didn't match physical layout. 2. **Missing fallback logic**: When source and target had no overlapping field names (e.g. {c0, c1} → {a, b}), name-based matching failed silently, returning NULLs. Added fallback to positional casting for non-overlapping fields. 3. **Optimizer const-folding issue**: ScalarValue::cast_to_with_options was calling Arrow's cast_with_options directly, which doesn't support struct field count changes. The optimizer's simplify_expressions rule would fail when trying to fold struct casts at compile time. 4. **Validation rejection**: The logical planner's can_cast_types check rejected struct-to-struct casts with mismatched field counts before execution. Added special handling to allow all struct-to-struct casts (validation at runtime). ## Solution - Created datafusion/common/src/struct_cast.rs with shared name-based struct casting logic for both runtime (ColumnarValue) and optimization-time (ScalarValue) - Updated ScalarValue::cast_to_with_options to use the name-based struct casting - Updated ColumnarValue::cast_to to use the shared logic - Updated Expr::cast_to validation to allow struct-to-struct casts - Added fallback to positional casting when field names don't overlap - Fixed struct array field access to use actual StructArray fields, not DataType fields - Updated tests to reflect new behavior and correct syntax issues ## Behavior Changes Struct casts now work correctly with: - Field reordering: {b: 3, a: 4} → STRUCT(a INT, b INT) → {a: 4, b: 3} - Field additions: {a: 1} → STRUCT(a INT, b INT) → {a: 1, b: NULL} - Field removals: {a: 1, b: 2, c: 3} → STRUCT(a INT, b INT) → {a: 1, b: 2} - Fallback to positional casting when no field names overlap ## Files Modified - datafusion/common/src/lib.rs: Added struct_cast module - datafusion/common/src/struct_cast.rs: New shared struct casting logic - datafusion/common/src/scalar/mod.rs: Use name-based struct casting - datafusion/expr-common/src/columnar_value.rs: Delegate to shared casting logic - datafusion/expr/src/expr_schema.rs: Allow struct-to-struct casts through validation - datafusion/sqllogictest/test_files/struct.slt: Fixed tests and added new ones
…ype comparison This aligns the type coercion logic with the new name-based struct casting semantics introduced for explicit CAST operations in issue apache#14396. Changes: - struct_coercion in binary.rs now attempts name-based field matching first - Falls back to positional matching when no field names overlap - Requires matching field counts for successful coercion - Preserves left-side field names and order when using name-based matching This fixes cases where CASE expressions with structs having the same fields in different orders now correctly match fields by name rather than position. Note: Several CASE expression tests with struct field reordering are disabled due to const-folding optimizer hang when evaluating struct literals with different field orders. This requires separate investigation and fix.
…-folding
This fixes the TODO tests in struct.slt that were causing optimizer hangs
when attempting to const-fold struct literal casts with field count mismatches.
Changes:
1. Modified expr_simplifier.rs can_evaluate() to skip const-folding for
struct CAST/TryCAST expressions where source and target have different
field counts. This prevents the optimizer from attempting to evaluate
these at plan time, deferring to execution time instead.
2. Modified cast.rs cast_with_options() to allow all struct-to-struct casts
at physical planning time, even when Arrow's can_cast_types rejects them.
These casts are handled by name-based casting at execution time via
ColumnarValue::cast_to.
3. Uncommented and fixed TODO tests in struct.slt:
- CAST({a: 1} AS STRUCT(a INT, b INT)) - adds NULL for missing field b
- CAST({a: 1, b: 2, extra: 3} AS STRUCT(a INT, b INT)) - ignores extra field
The fix ensures that:
- Optimizer doesn't hang trying to const-fold unsupported struct casts
- Physical planner accepts struct-to-struct casts with field count changes
- Execution uses name-based casting to handle field reordering and NULLs
This uncomments and fixes the TODO tests for struct coercion in CASE expressions
that were previously disabled due to concerns about optimizer hangs.
Changes:
1. Uncommented the TODO test section for struct coercion with different field orders.
Tests now verify that name-based struct coercion works correctly in CASE expressions.
2. Updated test expectations to match actual behavior:
- When THEN branch executes, result uses THEN branch's field order
- When ELSE branch executes, result uses ELSE branch's field order
- Struct coercion requires equal field counts - mismatch causes planning error
3. Added explicit test for field count mismatch case:
- Verifies that coercing structs with different field counts (2 fields vs 1 field)
correctly fails during type coercion with appropriate error message
The tests now pass because:
- The optimizer fix from the previous commit prevents const-folding hangs
- Name-based struct coercion in struct_coercion function handles field reordering
- Type coercion correctly rejects field count mismatches during planning
Remove duplicate struct_cast.rs module and use the existing nested_struct::cast_struct_column implementation instead. This eliminates code duplication and provides a single source of truth for struct field-by-name casting logic. Changes: - Add public cast_struct_array_by_name wrapper in nested_struct.rs - Update columnar_value.rs to use nested_struct::cast_struct_array_by_name - Update scalar/mod.rs to use nested_struct::cast_struct_array_by_name - Remove struct_cast module from lib.rs - Delete datafusion/common/src/struct_cast.rs Benefits: - Single implementation to maintain and test - Consistent behavior across all struct casting operations - Reduced maintenance burden for future bug fixes - Better code cohesion in nested_struct module
Implement name-overlap detection with positional fallback and clearer ambiguity errors for non-overlapping cases. Enhance unit tests and SQLLogicTest coverage to include tests for positional struct casting and scenarios lacking name overlap.
Implement null fast paths for struct casting and compatibility checks to return null struct arrays for NULL-only inputs. Allow NULL source fields during validation. Add a unit test covering the scenario of casting a NULL struct field into a nested struct target.
Return plan errors directly from optimizer rule failures and update the related test expectations. Relax the SQLogicTest expectation for struct cast mismatches to allow for either plan-error prefix variant.
…ind 'parquet' feature
Consolidate duplicated logic in cast_struct_column by using a single loop for both name-overlap and positional mapping cases. This change selects the source child by either name or position and centralizes the cast-and-error handling, improving code clarity and maintainability.
adriangb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting some changes, mostly minor. I do think this is the sort of PR where for next time it would be easier to review if the orthogonal changes were isolated into their own PRs.
| if source_field.data_type() == &DataType::Null { | ||
| return Ok(()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the check below still needs to run. What if the target field is not nullable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amended
| // Allow struct-to-struct casts even if Arrow's can_cast_types rejects them | ||
| // (e.g., field count mismatches). These will be handled by name-based casting | ||
| // at execution time via ColumnarValue::cast_to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we be applying the same rules here? It seems unfortunate that a lot of cases will succeed at planning time but fail at runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amended.
| /// | ||
| /// # Errors | ||
| /// Returns an error if the source is not a struct array or if field casting fails | ||
| pub fn cast_struct_array_by_name( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this really necessary to add to the public API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactor and removed this.
| if source_col.data_type() == &DataType::Null | ||
| || (!source_col.is_empty() && source_col.null_count() == source_col.len()) | ||
| { | ||
| return Ok(new_null_array( | ||
| &Struct(target_fields.to_vec().into()), | ||
| source_col.len(), | ||
| )); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessary if called from cast_column. Maybe document that this is needed if cast_struct_column gets called directly (if it ever is)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amended cast_column to remove the duplicate null check.
| if matches!(e, DataFusionError::Plan(_)) { | ||
| return Err(e); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change for all optimizer rule failures right? Seems a bit broad to include in this PR. Maybe we could merge it as its own commit first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this.
|
|
||
| /// Return true if every left-field name exists in the right fields (and lengths are equal). | ||
| fn fields_have_same_names(lhs_fields: &Fields, rhs_fields: &Fields) -> bool { | ||
| let rhs_names: HashSet<&str> = rhs_fields.iter().map(|f| f.name().as_str()).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think uniqueness of field names in structs is enforced elsewhere, but maybe we could add a comment here saying as to why we don't have to worry about struct<c1 intt> -> struct<c1 int, c1 int>`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added comment
| Expr::Cast(Cast { expr, data_type }) | ||
| | Expr::TryCast(TryCast { expr, data_type }) => { | ||
| if let ( | ||
| Ok(DataType::Struct(source_fields)), | ||
| DataType::Struct(target_fields), | ||
| ) = (expr.get_type(&DFSchema::empty()), data_type) | ||
| { | ||
| // Don't const-fold struct casts with different field counts | ||
| if source_fields.len() != target_fields.len() { | ||
| return false; | ||
| } | ||
| } | ||
| true | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this tested anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added tests
|
Thanks @kosiew I have also tested cast with structs as this PR can help Comet with apache/datafusion-comet#3052 About overlap this is what Spark does when no overlap and field count is different However if there is an overlap |
Add a null-check to ensure target fields are nullable before returning early. If a source field is of Null type but the target field is non-nullable, raise a clear error instead of allowing silent failures. The error message now explicitly identifies both the source and target fields.
…s and validate at planning time
- Added 3 unit tests in expr_simplifier.rs for struct cast scenarios: - Different field counts (should not be const-folded) - Same field count with matching names (can be const-folded) - Same field count with different names (validates name-based casting) - Created SQL logic test file: struct_cast.slt with 16 test cases covering: - Basic struct creation and casting - Field name matching semantics - Mismatched field counts (error cases) - NULL value handling - Nested structs - Field reordering by name - Type conversions in struct fields - Field access after casting
…ounts in SQL logic test
Prevents dimension mismatch when simplifier evaluates 0-row struct literals against its 1-row input batch, fixing test failures.
- Use 1-row struct arrays to enable const-folding evaluation - Assert results are Literal (const-folded) not Cast - Verify simplification occurred (result != expr)
|
Thanks for sharing the scala test results.
In other words, when there is no field name overlap AND field counts differ, Spark treats the entire struct as incompatible and returns NULL rather than attempting positional or partial casting. In this PR, we return an error.
// Source in Parquet: {col1: int, col2: int} spark.read.schema("str struct<col3 int, col4 int, col5 int, col1: int>").parquet("/tmp/test_struct").show(false) When at least one field name matches (in this case,
This PR has the same behaviour. |
4452d56 to
5a4ecb1
Compare
| /// that apply to all columns and optional column-specific overrides | ||
| /// | ||
| /// Closely tied to [`ParquetWriterOptions`](crate::file_options::parquet_writer::ParquetWriterOptions). | ||
| /// Closely tied to `ParquetWriterOptions` (see `crate::file_options::parquet_writer::ParquetWriterOptions` when the "parquet" feature is enabled). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not related to this PR.
Fix to enable CI to pass.
Which issue does this PR close?
Rationale for this change
DataFusion’s struct casting and some coercion paths were effectively positional: when two structs had the same field types but different field orders, casting could silently swap values. This is surprising to users and can lead to silent data corruption (e.g.
{b: 3, a: 4}::STRUCT(a INT, b INT)yielding{a: 3, b: 4}).The goal of this PR is to make struct casting behavior match user expectations by matching fields by name (case-sensitive) and recursively applying the same logic to nested structs, while keeping a compatible fallback for structs with no shared field names.
What changes are included in this PR?
Name-based struct casting implementation in
datafusion_common::nested_struct:struct(1, 'x')::STRUCT(a INT, b VARCHAR)style casts).validate_field_compatibilityand helperfields_have_name_overlap.Ensure struct casting paths use the name-based logic:
ScalarValue::cast_to_with_options: routeStructcasts throughnested_struct::cast_column.ColumnarValue::cast_to: forStructtargets, cast vianested_struct::cast_column; non-struct casts still use Arrow’s standard casting.Type coercion improvements for structs in binary operators / CASE:
Planning-time cast validation for struct-to-struct:
physical-exprCAST planning now validates struct compatibility using the same rules as runtime (validate_struct_compatibility) to fail fast.ExprSchemableallows struct-to-struct casts to pass type checking; detailed compatibility is enforced by the runtime / planning-time validator.Optimizer safety:
Tests and SQL logic tests:
New unit tests covering:
Updated/added
.sltcases to reflect the new semantics and to add coverage for struct casts and nested struct reordering.Minor docs/maintenance:
ParquetWriterOptionsso it doesn’t break when theparquetfeature is disabled.Are these changes tested?
Yes.
Added/updated Rust unit tests in:
datafusion/common/src/nested_struct.rsdatafusion/expr-common/src/columnar_value.rsdatafusion/optimizer/src/simplify_expressions/expr_simplifier.rsAdded/updated SQL logic tests in:
datafusion/sqllogictest/test_files/case.sltdatafusion/sqllogictest/test_files/struct.sltThese tests cover:
Are there any user-facing changes?
Yes.
No public API changes are introduced, but this is a semantic change in struct casting.
LLM-generated code disclosure
This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.